AITopics | Phang Nga

Collaborating Authors

Phang Nga

Can LLMs Help Create Grammar?: Automating Grammar Creation for Endangered Languages with In-Context Learning

Spencer, Piyapath T, Kongborrirak, Nanthipat

arXiv.org Artificial IntelligenceDec-14-2024

Yes! In the present-day documenting and preserving endangered languages, the application of Large Language Models (LLMs) presents a promising approach. This paper explores how LLMs, particularly through in-context learning, can assist in generating grammatical information for low-resource languages with limited amount of data. We takes Moklen as a case study to evaluate the efficacy of LLMs in producing coherent grammatical rules and lexical entries using only bilingual dictionaries and parallel sentences of the unknown language without building the model from scratch. Our methodology involves organising the existing linguistic data and prompting to efficiently enable to generate formal XLE grammar. Our results demonstrate that LLMs can successfully capture key grammatical structures and lexical information, although challenges such as the potential for English grammatical biases remain. This study highlights the potential of LLMs to enhance language documentation efforts, providing a cost-effective solution for generating linguistic data and contributing to the preservation of endangered languages.

large language model, machine learning, moklen, (19 more...)

arXiv.org Artificial Intelligence

2412.1096

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
North America > United States > Pennsylvania (0.04)
North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
(6 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges

Van Dinh, Nguyen, Dang, Thanh Chi, Nguyen, Luan Thanh, Van Nguyen, Kiet

arXiv.org Artificial IntelligenceOct-4-2024

Vietnamese, a low-resource language, is typically categorized into three primary dialect groups that belong to Northern, Central, and Southern Vietnam. However, each province within these regions exhibits its own distinct pronunciation variations. Despite the existence of various speech recognition datasets, none of them has provided a fine-grained classification of the 63 dialects specific to individual provinces of Vietnam. To address this gap, we introduce Vietnamese Multi-Dialect (ViMD) dataset, a novel comprehensive dataset capturing the rich diversity of 63 provincial dialects spoken across Vietnam. Our dataset comprises 102.56 hours of audio, consisting of approximately 19,000 utterances, and the associated transcripts contain over 1.2 million words. To provide benchmarks and simultaneously demonstrate the challenges of our dataset, we fine-tune state-of-the-art pre-trained models for two downstream tasks: (1) Dialect identification and (2) Speech recognition. The empirical results suggest two implications including the influence of geographical factors on dialects, and the constraints of current approaches in speech recognition tasks involving multi-dialect speech data. Our dataset is available for research purposes.

dataset, dialect, experiment, (17 more...)

arXiv.org Artificial Intelligence

2410.03458

Country:

Asia > Vietnam > Hanoi > Hanoi (0.14)
Asia > Vietnam > Thanh Hóa Province > Thanh Hóa (0.04)
Asia > Vietnam > Hưng Yên Province > Hưng Yên (0.04)
(65 more...)

Genre: Research Report > New Finding (0.66)

Industry: Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Information Extraction based on Named Entity for Tourism Corpus

Chantrapornchai, Chantana, Tunsakul, Aphisit

arXiv.org Artificial IntelligenceJan-3-2020

Tourism information is scattered around nowadays. To search for the information, it is usually time consuming to browse through the results from search engine, select and view the details of each accommodation. In this paper, we present a methodology to extract particular information from full text returned from the search engine to facilitate the users. Then, the users can specifically look to the desired relevant information. The approach can be used for the same task in other domains. The main steps are 1) building training data and 2) building recognition model. First, the tourism data is gathered and the vocabularies are built. The raw corpus is used to train for creating vocabulary embedding. Also, it is used for creating annotated data. The process of creating named entity annotation is presented. Then, the recognition model of a given entity type can be built. From the experiments, given hotel description, the model can extract the desired entity,i.e, name, location, facility. The extracted data can further be stored as a structured information, e.g., in the ontology format, for future querying and inference. The model for automatic named entity identification, based on machine learning, yields the error ranging 8%-25%.

information, ontology, raw corpus, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/JCSSE.2019.8864166

2001.01588

Country:

Asia > Thailand > Bangkok > Bangkok (0.05)
Europe > Austria > Vienna (0.05)
Europe > Austria > Tyrol > Innsbruck (0.04)
(14 more...)

Genre: Research Report (0.41)

Industry: Consumer Products & Services > Travel (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Semantic Search using Spreading Activation based on Ontology

Vuong, Ngo Minh

arXiv.org Artificial IntelligenceMay-9-2019

Currently, the text document retrieval systems have many challenges in exploring the semantics of queries and documents. Each query implies information which does not appear in the query but the documents related with the information are also expected by user. The disadvantage of the previous spreading activation algorithms could be many irrelevant concepts added to the query. In this paper, a proposed novel algorithm is only activate and add to the query named entities which are related with original entities in the query and explicit relations in the query.

artificial intelligence, natural language, ontology, (12 more...)

arXiv.org Artificial Intelligence

1905.06114

Country:

Europe > United Kingdom (0.06)
Asia > Southeast Asia (0.05)
Asia > Thailand > Phuket > Phuket (0.05)
(6 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Systems & Languages > Programming Languages (0.61)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.61)
Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (0.58)

Add feedback